Introduction

In many biomedical and social science settings, observations are clustered and interventions are applied in units, and the outcomes within a unit may be correclated with one another - for example, patients within hospitals or students within classrooms. This clustering complicates data analysis because it violates the traditional regression assumption that observations are independent. For these cluster-randomized trials (CRTs) with continuous outcomes, linear mixed models (LMMs), which assume that observations are independent conditional on cluster membership, are the preferred approach to the data analysis. CRTs are a widely used experimental design (see for example Moon et al. (2017), Vinereanu et al. (2017), and Huang et al. (2013)), and in that setting LMMs are an attractive option for data analysts because they can flexibly accommodate nested levels of clustering, differing cluster sizes, and are robust to missing completely at random (MCAR) or missing at random (MAR) data. Generalized linear mixed models (GLMMs) extend the approach to non-normally-distributed data, such as with binary, count, or multinomial outcomes.

In a CRT, there are typically two assumed sources of variability in outcomes: between clusters, denoted here as \(\sigma^2_b\), and within clusters, denoted as \(\sigma^2\). One way of quantifying the amount of clustering is via the intracluster correlation coefficient (ICC) \(\rho\), defined as \(\rho = \frac{\sigma^2_b}{\sigma^2_b + \sigma^2}\), or the proportion of total variance due to the cluster variability. When analyzing the data with a LMM, standard errors for fixed effect coefficients have to be adjusted relative to the independence assumption. This adjustment, the variance inflation factor (VIF), depends on the ICC. If the clusters are large, even a very small value of the ICC from a relatively small value of \(\sigma^2_b\) can generate a large VIF and meaningfully change inferences. For this reason, it is recommended that potential between-cluster variation be explicitly modeled in the (G)LMM data analysis. [CITEEEEEEEEEE]

Given the popularity and flexibility of (G)LMMs, they have been used in many different contexts and with widely varying sample sizes. In a sample of 140 recently published cluster randomized trials in various medical journals, over 30% had fewer than 20 clusters, and the number of subjects per cluster varied from less than 5 to over 10,000. Most major statistical software packages implement algorithms for (G)LMM fitting, but inferences based on these algorithms usually rely on asymptotic assumptions, and the finite-sample performance of these algorithms has NOT BEEN STUDIED ENOUGH??. [List various facts about small sample performance with appropriate citations.] Experimental results from simulation studies have shown , , and ___, with appropriate citations. Or lack of experimental results.

Our original goal was to study the impact on Type I error (TIE) rates of allowing the cluster-level variance to differ by treatment arm in a CRT, but along the way we discovered that in many cases the TIE rates were not close to their nominal level in either case. This work examines the realized TIE rates in a simulation study, using model-based and likelihood ratio test-based approaches. We show that major software packages often do not meet the nominal level in small-sample settings, especially when the number of clusters is small and the number of subjects per cluster is large.

Study Design

We generated clustered, balanced data from the null model

\[ y_{ij} = b_{0i} + e_{ij} \]

for clusters \(i = 1, 2, ..., K\) and individuals \(j = 1, 2, ..., N\) within each cluster. The random intercept \(b_{0i}\) for cluster \(i\) was distributed \(\sim N(0, \sigma_b^2)\), and the residual error term \(e_{ij} \sim N(0, \sigma^2)\).

We then fit models that assumed the clusters were evenly divided into two arms \(x \in (0, 1)\) for treatment and control, allowing for clustering:

\[ y_{ij} = \beta_0 + \beta_1 x_{ij} + b_{0i} + e_{ij} \]

Assuming an \(\alpha\) level of \(.05\), We gathered p-values for the \(\beta_1\) coefficients and, since the data-generating mechanism had a true \(\beta_1\) value of zero, we compared the TIE rate to the nominal \(\alpha = .05\) level.

We performed our analysis on all combinations of the following data-generating parameters:

  • total number of clusters \(K\in \{10, 20, 40, 100\}\), divided evenly among the two treatment arms

  • subjects per cluster \(N \in \{ 3, 5, 10, 20, 50\}\)

  • \(\sigma_b^2 \in \{0.001, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5\}\)

  • \(\sigma^2\) was fixed at 1 after finding that ratios of \(\sigma_b^2\) to \(\sigma^2\) and the ICC were the more important than the absolute values thereof.

For each combination, we performed 10,000 simulations, and gathered p-values for \(\beta_1\) in two different ways. First, we used model-based Wald test p-values given by software, fitting with restricted maximum likelihood (REML). Second, we generated p-values from the same software using the likelihood ratio test (LRT), this time fitting with maximum likelihood (ML).

The model-based p-values came from the following:

  • SAS PROC MIXED, with the default settings

  • SAS PROC MIXED, using the Kenward-Roger method of calculating degrees of freedom, with the SMALLSAMPCORRECTION

  • The lme() function of the nlme package in R, which generates p-values from a t-test with \(K-2\) degrees of freedom (df)

  • The lmerTest package, which generates p-values for models fit using lme4, using the Satterthwaite df method

  • The pbkrtest package, which generates p-values for models fit using lme4, using the Kenward-Rogers df method

The LRT p-values came from SAS PROC MIXED, lme4, and nlme, using the default settings with ML estimation. Talk about DF choices for lmerTest, nlme, and SAS.

Finally, we also used the fill in stuff here about SAS KR correction

Results

Combine ALL Plots

Plots separate by method

Plots Separate by software

Plots all separated

Discussion

References

Huang, Susan S., Edward Septimus, Ken Kleinman, Julia Moody, Jason Hickok, Taliser R. Avery, Julie Lankiewicz, et al. 2013. “Targeted Versus Universal Decolonization to Prevent ICU Infection.” New England Journal of Medicine 368 (24): 2255–65. https://doi.org/10.1056/NEJMoa1207290.

Moon, Rachel Y., Fern R. Hauck, Eve R. Colson, Ann L. Kellams, Nicole L. Geller, Timothy Heeren, Stephen M. Kerr, et al. 2017. “The Effect of Nursing Quality Improvement and Mobile Health Interventions on Infant Sleep Practices: A Randomized Clinical Trial.” JAMA 318 (4): 351–59. https://doi.org/10.1001/jama.2017.8982.

Vinereanu, Dragos, Renato D. Lopes, M. Cecilia Bahit, Denis Xavier, Jie Jiang, Hussein R. Al-Khalidi, Wensheng He, et al. 2017. “A Multifaceted Intervention to Improve Treatment with Oral Anticoagulants in Atrial Fibrillation (IMPACT-AF): An International, Cluster-Randomised Trial.” The Lancet 390 (10104): 1737–46. https://doi.org/10.1016/S0140-6736(17)32165-7.